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A priori truncation method for posterior sampling 
from homogeneous normalized completely random 
measure mixture models 

Raffaele Argiento* Ilaria Bianchini' Alessandra Guglielmi ' 

Abstract 

This paper adopts a Bayesian nonparametric mixture model where the mixing distri¬ 
bution belongs to the wide class of normalized homogeneous completely random mea¬ 
sures. We propose a truncation method for the mixing distribution by discarding the 
weights of the unnormalized measure smaller than a threshold. We prove convergence 
in law of our approximation, provide some theoretical properties and characterize its 
posterior distribution so that a blocked Gibbs sampler is devised. 

The versatility of the approximation is illustrated by two different applications. In 
the first the normalized Bessel random measure, encompassing the Dirichlet process, 
is introduced; goodness of fit indexes show its good performances as mixing measure 
for density estimation. The second describes how to incorporate covariates in the sup¬ 
port of the normalized measure, leading to a linear dependent model for regression and 
clustering. 

Keywords: Bayesian nonparametric mixture models • normalized completely random 
measure • blocked Gibbs sampler • finite dimensional approximation • a priori trun¬ 
cation method 


1 Introduction 

One of the livelier topic in Bayesian Nonparametrics concerns mixtures of parametric den¬ 
sities where the mixing measure is an almost surely discrete random probability measure. 
The basic model is what is known now as Dirichlet process mixture model, appeared first 
in Lo (1984), where the mixing measure is indeed the Dirichlet process. Dating back to 
Ishwaran and James (2001) and Lijoi et al. (2005), many alternative mixing measures have 
been proposed; the former paper replaced the Dirichlet process with stick-breaking random 
probability measures, while the latter focused on normalized completely random measures. 
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These hierarchical mixtures play a pivotal role in modern Bayesian Nonparametrics, 
since their potentialities range within many applications. Indeed, they can easily be ex¬ 
ploited in very different contexts: for instance, graphical models, topic modeling or biologi¬ 
cal applications. Their popularity is mainly due to the high flexibility in density estimation 
problems as well as in clustering, which is naturally embedded in the model. 

Often the Dirichlet Process prior is employed as mixing measure because of its mathe¬ 
matical and computational tractability: however, in some statistical applications, clustering 
induced by the Dirichlet process may be restrictive. In fact, it is well-know that the lat¬ 
ter allocates observations to clusters with probabilities depending only on the cluster sizes, 
leading to the "the rich gets richer" behavior. Within some classes of more general processes, 
as, for instance, stick-breaking and normalized processes, the probability of allocating an ob¬ 
servation to a specific cluster depends also on extra parameters, as well as on the number 
of groups and on the cluster's size. We refer to Argiento et al. (2015) for a recent review of 
state of art on Bayesian nonparametric mixture models and clustering. 

Since, when dealing with nonparametric mixtures, the posterior inference involves an 
infinite-dimensional parameter, this may lead to computational issues; this limit prevents 
applied statisticians from exploiting models beyond Dirichlet process mixtures when deal¬ 
ing with modern real-life applications. However, there is a recent and lively literature focus¬ 
ing mainly on two different classes of MCMC algorithms, namely marginal and conditional 
Gibbs samplers. The former integrate out the infinite dimensional parameter (i.e. the ran¬ 
dom probability), resorting to generalized Polya urn schemes; see Favaro and Teh (2013) or 
Lomeli et al. (2014). The latter include the nonparametric mixing measure in the state space 
of the Gibbs sampler, updating it as a component of the algorithm; this group includes the 
slice sampler (see Griffin and Walker, 2011). Among conditional algorithms there are trun¬ 
cation methods, where the infinite parameter (i.e. the mixing measure) is approximated by 
truncation of the infinite sums defining the process, either a posteriori (Argiento et al., 2010; 
Barrios et al., 2013) or a priori (Argiento et al., 2015; Griffin, 2013). 

In this work we introduce an almost surely finite dimensional class of random proba¬ 
bility measures that approximates the wide family of homogeneous normalized completely 
random measures (Regazzini et al., 2003; Kingman, 1975); we use this class as the building 
block in mixture models and provide a simple but general algorithm to perform posterior 
inference. Our approximation is based on the constructive definition of the weights of the 
completely random measure as the points of a Poisson process on JR h . In particular, we 
consider only points larger than a threshold e, controlling the degree of approximation. The 
construction given here generalizes Argiento et al. (2015) where the particular class of nor¬ 
malized generalized gamma processes was considered. Conditionally on e, our process is 
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finite dimensional either a priori and a posteriori. 

As detailed later, the two main ingredients to build a normalized completely random 
measure are p(s), s > 0, the intensity of the Poisson process determining the weights of 
the measure on the one hand, and the so-called centering measure Pq(-), characterizing the 
locations of the measure, on the other. Here we illustrate two applications. In the first, 
a new choice for p is proposed: the Bessel intensity function, that, up to our knowledge, 
has never been applied in a statistical framework, but in finance (see Barndorff-Nielsen, 
2000, for instance). On the other hand, we fix the centering measure Po to be the normal 
inverse-gamma, a conjugate choice when the kernel is Gaussian. We call this new process 
normalized Bessel random measure. In the second application, we set p to be the well- 
known generalized gamma intensity and consider a centering measure Pox depending on on 
a set of covariates x, yielding a linear dependent normalized completely random measure. 
For a recent survey on dependent nonparametric processes in the Statistics and Machine 
Learning literature see Foti and Williamson (2015). 

In this paper, since the main objective is the approximation of the nonparametric process 
arisen from the normalization of completely random measures, we fix e to a small value. 
However, it is worth mentioning that it is possible to elicit a prior for e, but the computa¬ 
tional cost might greatly increase for some p. 

The main achievements of this works can be summarized as follows: first we show that, 
for e going to zero, the finite dimensional e-approximation of homogeneous normalized 
completely random measures converges to its infinite dimensional counterpart, and com¬ 
pute its prior moments (Sections 3 and 4). Then we provide a Gibbs sampler for the e 
-approximation hierarchical mixture model (Section 5). Section 6.1 is devoted to the intro¬ 
duction of the normalized normalized Bessel random measure, and some of its properties; 
on the other hand. Section 6.2 discusses an application of the e-Bessel mixture models to 
both simulated and real data. Section 7 defines the linear dependent e-NGGs, and consider 
linear dependent e-NGG mixtures to fit the AIS data set. To complete the set-up of the paper. 
Section 2 is devoted to a summary of basic notions about homogeneous NRMIs, and Section 
8 contains a conclusive discussion. 

2 Preliminaries on homogeneous normalized completely random 
measures 

Let us briefly recall the definition of a homogeneous normalized completely random mea¬ 
sure. Let 0 C Ift m for some positive integer m. A random measure // on 0 is completely ran¬ 
dom if for any finite sequence B\, E> 2 , ■ ■., of disjoint sets in 13(0), p(Bi), p(B 2 ),..., p(Bk ) 
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are independent. A purely atomic completely random measure is defined (see Kingman, 
1993, Section 8.2) by p(-) = Ylj>i Jj (•)/ where the {{Jj, +])}f>\ are the points of a Poisson 
process on M + x 0. We denote by v{ds , dr) the intensity of the mean measure of such a Pois¬ 
son process. A completely random measure is homogeneous if v{ds, dr) = p(s)dsnPo(dT), 
where p(s) is the density of a non-negative measure on M + , while kPq is a finite measure 
on 0 with total mass k > 0. If p is homogeneous, the support points, that is {r ; }, and the 
jumps of p, {Jj}, are independent, and the r/s are independent identically distributed (iid) 
random variables from Pq, while { Jj} are the points of a Poisson process on M + with mean 
intensity p. Furthermore, we assume that p satisfies the following regularity conditions: 

/*+OO /»+OO 

(1) / min{l, s}p(s)ds < oo and / p(s)ds = + oo, 

Jo Jo 

so that, if T := p(0) = Ylj>i Jj' 1P(0 < T < +oo) = 1. Recall that the distribution of T is 
uniquely determined by its Laplace transform, given by: 

r+oo 

(2) E(e -Ar ) = exp {— k / (1 — e~ Xs )p(s)d-s}, A > 0. 

Jo 

Therefore, a random probability measure (r.p.m.) P can be defined through normalization 
of p: 


( 3 ) 




+00 

E % • 


3 =1 


We refer to P in (3) as a (homogeneous) normalized completely random measure with pa¬ 
rameter (p, kPq). As an alternative notation, following James et al. (2009), P is referred as a 
homogeneous normalized measure with independent increments. The definition of normal¬ 
ized completely random measures appeared in Regazzini et al. (2003) first. An alternative 
construction of normalized completely random measures can be given in terms of Poisson- 
Kingman models as in Pitman (2003). 


3 ^-approximation of normalized completely random measures 

The goal of this section is the definition of a finite dimensional random probability measure 
that is an approximation of a general normalized completely random measure with Levy's 
intensity given by u(ds, dr) = p(ds)Kp)[dT), introduced above. 

First of all, by the Restriction Theorem for Poisson processes, for any e > 0, all the 
jumps {Jj} of p larger than a threshold £ are still a Poisson process, with mean intensity 
7 S (s) := Kp(s)I( £j+0O )(s). Moreover, the total number of these points is Poisson distributed, 
i.e. AL T’o(Ae) where 

r+oo 

A £ := 7 e(M + ) =K p(s)ds. 
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Since A £ < +oo for any e > 0 thanks to the regularity conditions (1), N e is almost surely 
finite. In addition, conditionally to N e , the points { J\,..., Jj \ e } are iid from the density 


(4) 


Pe(s) 


7e(g) 

a £ 


A e 


^(e,+oo)(®)j 


thanks to the relationship between Poisson and Bernoulli processes; see, for instance. King- 
man (1993), Section 2.4. However, in this case, while PfX^i Jj < °°) = 1/ the condition 
on the right hand side of ( 1 ) is not satisfied, so that P(X/J=i Jj = 0) > 0, or, in other terms, 
P (N e = 0) > 0 for any e > 0. For this reason we consider N e + 1 iid points {Jo, J \,..., Jn s } 
from p £ and define the completely random measure p £ {-) = J?A, (■)/ as well as its 

normalized counterpart: 


N e 


(5) 


p =(') = E p A 

3=0 


Ji 


N e 

= y —a . 

T 3 
3=0 ±e 


(•), 


where 77 = J2f=oJj' T j ~ Po> { T j} and {Jj} independent. We denote P £ in (5) by e- 
NormCRM and write P £ ~ £—NormCRM(p,nPo). Whenp £ (s) = l/(w°T(— a, uj£))s~ a ^ 1 e~ uls , 
s > e, P £ is the e-NGG process introduced in Argiento et al. (2015), with parameter (a, n, Pq), 

0 < a < 1 , k > 0 . 

Both the infinite and finite dimensional processes defined in (3) and (5), respectively, 
belong to the wide class of species sampling models, deeply investigated in Pitman (1996), 
and we use some of the results there to derive ours. Let ( 6 1 ,..., 0 n ) be a sample from (3) or 
(5) (or more generally, from a species sampling model); since it is a sample from a discrete 
probability, it induces a random partition p n := {C\ ,..., 67 , } on the set N n := {1,..., n} 
where Cj = {i : 0 t = 0* } for j = I..... k. If 4{C r = ri t for 1 < i < k, the marginal law of 
(0i,...,0 n ) has unique characterization: 

k 

C{p n ,0l,... ,91) = p(ni,... ,n k )Y[£(9j), 

3 = 1 

where p is the exchangeable partition probability function (eppf) associated to the random 
probability. The eppf p is a probability law on the set of the partitions of N n . The following 
proposition provides an expression for the eppf of a general e-NormCRM. 

Proposition 1. Let (m ,..., n^) be a vector of positive integers such that Yli=i n * = n ■ Then, the 
eppf associated with a P £ ~ e-NormCRM(p, kPq) is 



u n 1 {k + A £ , u ) (Ae u _ Ae) 
r(n) A £ 



Ks ni e us p(-s)ds 


( 6 ) p £ (m . ,n k ) 


du 
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ivhere 

( 7 ) 


r+oo 


A-e,u • — 


:= K 


S p(s)ds, 


u > 0. 


Proof. We have 


+oo 


A*« 


p E (ni,...,n k ) = Y Pe{ni,... ,n k \N £ )^-je 


-A e 


N e =0 


since N e ~ Poi( A e ). Then, equation (30) in Pitman (1996) yields 

p £ (m,...,n k \N £ ) jv e +i}(*0 Y 

ji,-,jk \i=l / 

where the vector {j \,..., j k ) ranges over all permutations of k elements in {0,, N £ }. Then, 
using the gamma function identity, 1/T e n = J 0 +o ° 1/T (n)u n ~ 1 e~ uTe du, we have: 


n k JTti 

p £ (m,..,n k \N e )= i {li „ m>Ne+ iy(k) y 


*=i 


= I 


JV £ +1 


}(fc) ^ 


r+OO 


du 


31 


1 " r+oo 

r+oo \ 

II / e- J ‘ u Pc(Jj)dJj 

31,-,3k} ° / 


Now, by the definition of p £ in (4) and adopting the notation p(s) := np{s), it is straightfor¬ 
ward to see that 


Pe{ni,..,n k \N £ ) = I{i,...,jv e+ i}(fc) Y 


P+OO 


du 


1 


31,—-,3k 


r(n) 


■n 

1=1 ' 


r+oo 


A, 


/ +oo 

3<t\3l,-,3kt 


-JjU P(Jj 


\ dJj 

A, J 


By (8), we have 


+oo 


p £ (n 1 ,...,n k ) = Y hi,-,N £ +i}(k) Y 


r+oo 


du 


u"” 1 1 


N e = 0 


31,-,3k 


o \T(n) Af ^ 


k P+OO 

II J 


^N e -\-l—k 


n 

.+ {+..//, 1 


p+oo 


+oo 




—A e /*+oo f „,n-l / p+oo 


A^ e =0 


y 0 


h 


e~ JjU PiJjWj 

N e +l-k 


Yi e -^ 

N £ \ 


e~ JjU p(J j )dJ j 


/ K r+oo \ 

x E (11/ jdu. 
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Denoting by N na := N e + 1 — k the number of non-allocated jumps, we get 


Pe(ni,...,n k ) = 


,+oo j' u n -1 g—A e g jV £ + 1 


\r(n) A e 2^ o (N £ + l-k)\ {1 ’->^ +1 


}(*) 


a +oo \ iV e +l—fc / k i>+oo \ 'j 

e~ JjU p(J j )dJ j j f J] J \du 


r+oo ^n—1 g—A e ^ /*+oo 


+oo 


/o r (n) A s 


n / d^e~ Jj i u p(Jj i )dJj i £ Afr^i^Ir 


N na =0 


V £,U at | 1 {l,...,A r e + l 

j V TiV7 • 


}(fc) du. 


Since the last summation adds up to e Ae -" (A s . n + k), the p e (ni,..., n k ) and the right hand- 
side of (6) coincide. □ 


A result concerning the eppf of a generic normalized (homogeneous) completely random 
measure can be readily obtained from Pitman (2003), formulas (36)-(37): 

r+oo „,n— 1 , (r+oo \ 

(9) p( ni , ...,n k ) = —e+ °° {e ~ US ~ lMs)ds ( ]I J o Ks ni e~ us p(s)dsj du. 

Now we are ready to show that the eppf of (5) converges pointwise to that of the corre¬ 
sponding (homogeneous) normalized completely random measure (3) when e —>• 0. 

Proposition 2. Let p £ f) be the eppf of a £—NormCRM(p, kPq). Then for any sequence n\,..., n k 
of positive integers with k > 0 and ffn=i n i = n > 

(10) lim p e (n \, ...,n k ) = p 0 +i ,.. .,n k ), 

£->0 

where p o(-) is the eppf of the NormCRM(p, kPq) as in (9). 


Proof. By Proposition 1 


where 


r+oo 

Pe{ni, ■ ■ ■ ,n k ) = / fe(u;ni,... ,n k )du 
Jo 


(11) fe(u-ni,...,n k ) = ^— e (A s ,u f ns ni e us p(s)ds, u> 0 

r W A £ f = , Je 

On the other hand, the eppf of a NormCRM(p, kPo) can be written as 


f+oo 


Po(ni,...,n k ) = / /o(u;ni,... ,n k )du, 


where 


u n ~ 1 f r+oo 'i f r+oo 

fo(u] ni,, n k ) = p^y exp y (e _ “ s - l)p(s)ds| J n.s ni e~ us p(s), u> 0. 
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We first show that 


(12) limf £ (u;n 1 ,...,n k ) = f(u;n 1 ,...,n k ) for any u > 0. 

£—►0 


In particular, we have that 


O+OO 


lim 

£—^0 


s ni e- us p(s)ds 



s ni e- us p(s)ds 


and 


r»+oo 


lim e Ae ’ u Ae = exp | k J (e us — l)p(s)ds| , 

being this limit finite for any u > 0. Using standard integrability criteria, it is straightfor¬ 
ward to check that, for any u > 0, linv^o A £;U = lim e _>o A e = +oo and they are equivalent 
infinite, i.e. 


k + A £ u A £ u 1 

lim--- = lim - = 1. 

£- ^0 i\.£ £- ^0 i\.£ 


We can therefore conclude that (12) holds true. 

The rest of the proof follows as in the second part of the proof of Lemma 2 in Argiento 
et al. (2015), where we prove that (i) lim £ ^ 0 Xcen„ Ve{n\, ■ ■ ■, n k ) = 1; (ii) lim inf e _».o Pe(«i, ■ ■ ■, n k ) 
Po(n\,..., n k ) for all C = (Ci, ..., C k ) G IT n , the set of all partitions of {1,2,..., n}; (in) 
Xcgn„ To(ui,..., n k ) = 1. By Lemma 1 in Argiento et al. (2015), equation (10) follows. 

□ 


Convergence of the sequence of eppfs yields convergence of the sequences of e-NormCRMs, 
generalizing a result obtained for e-NGG processes. 

Proposition 3. Let P £ be a e-NormCRM(p , nPo),for any e > 0. Then 

P e 4P(!S£->0, 

where P is a NormCRM(p , kPq). Moreover, as e —t +oo, P £ -4 S T0 , where r 0 ~ Pq. 

Proof Since P £ is a proper species sampling model, p £ defines a probability law on the sets 
of all partitions of {1,..., n}, for any positive integer n; let (TVf,.... Nf) denote the sizes of 
the blocks (in order of appearance) of the random partition GV.n defined by p £/ for any e > 0. 
The probability distributions of {(A r f ,... ,N^),e > 0} are proportional to the values of p £ 
(for any e > 0) in (2.6) in Pitman (2006). Hence, by Proposition 2, for any k = 1 , ... ,n and 
any n, 


(Aff, • • •, N k ) —> (A} 5 ,..., N®) as e —> 0, 
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where (N ®,..., N®) denote the sizes of the blocks of the random partition C £ , n defined by 
Po, the eppf of a NormCRM(p, kPo) process. By formula (2.30) in Pitman (2006), we have 

d 


n 


n—>•+oo 


> (p?) 


£—^0 


Nl 

n 


n—>•+oo 


> (Pi) 


where PJ and Pj are the j-th weights of a c-NormCRM and a NormCRM process (with 
parameters (p, kPq)), respectively. We prove that 


Ej>o Pf^' 


•' ' •' n—H-oo 


* 'Dj]> 0 Pj ^ T j ’ 


where to, ti, 12 ,... are iid from Po and this ends the first part of the Proposition. 

Convergence as e —>• +00 is straightforward as well. In fact, when e increases to + 00 , 
there are no jumps to consider in (5) but the extra Jo, so that P £ degenerates on d TQ . □ 


Let 6 = ( 61 ,..., 9 n ) be a sample from P e , a e-NormCRM(p, kPo ) as defined in (5), and 
let d* = ( 6 *,... ,0%) be the (observed) distinct values in 6 . We denote by allocated jumps of 
the process the values P;*, P;*,..., Pi* in (5) such that there exists a corresponding location 
for which T t * = 0 *,i=l,...,k. The remaining values are non-allocated jumps. We use the 
superscript ( na ) for random variables related to non-allocated jumps. We also introduce the 
random variable U := T n /T e , where r„ ~ gamma(n, 1), being T n and T e independent. 


Proposition 4. If P e is an e—NormCRM(p, hPq), then the conditional distribution of P s , given 6 * 
and U = u, verifies the distributional equation 

k 

PU-) = w+ (1 - W) r, w * ; (0 

3 = 1 


where 


1. pDfD (•), the process of non-allocated jumps, is distributed as a e—NormCRM(e u 'p {-), kPq), 
given that exactly N na jumps of the process were obtained, and the posterior law of N na is 


A-e,u 

k + A £jU 


Pi(A £ jU ) + 


k + A £)U 


-Po(A-e,v 


being A £)U as defined in (7), and denoting VfX) the shifted Poisson distribution on {i, i+ 1, * + 
2,...} with mean i + A, i = 0,1; 


2. the allocated jumps {p[ a \ ..., pjf' 1 } associated to the fixed points of discontinuity G* = 
(PJ,..., 01) of P* are obtained by normalization of J? i e~ ujj e~ ujj p(Jj)I( e i+00 )(Jj), 

for j = 1 • • •, k; 
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3. Pe™u\-) and { j["\ • • • , J^} are independent, conditionally to l* = (/*,..., If), the vector of 
locations of the allocated jumps; 

4. w is defined as 0 when N na = 0, otherwise w = T £tU /(T £ ^ u + Ylj =l )■ T £i „ is the total 

sum of the jumps in representation of P £ ff\-) as in (5); 

5. the posterior law ofU given G* has density on the positive real given by 

A i U k r +oo 

fu\9*(u\0*) oc u n - 1 e A '‘ u ~ A ' e ’ u I] / Acs ni e- us p(s)ds, u > 0 . 

This proposition is the "finite dimensional" counterpart of Theorem 1 in James et al. 
(2009). 

Proof. The first steps of the proof are the same as in the proof of Proposition 2 in Argiento 
et al. (2015); in particular, the joint law of 6, u, P £ , £(G, u, P £ ), is as in (16) in Argiento et al. 
(2015). The conditional distribution of P £ , given U = u and 0, is as follows: 

(13) jC(P £ \u,G) = £{t,J,N £ \u,G) = jC(t,J\N £ ,u,G)jC{N £ \u,G), 


where the second factor in the right hand side is proportional to 


C(N £ ,u, G) = J dJo ... dJN e dro ... dTN e C(r, J, N e ,u, G) 

£ {[n [ PM*)Mni)dJ n dr n 

I!.5 l 4= 

f e~ uJj p £ (Jj)Po(Tj)dJjdr 

n* J 


( r k n, —uJi* —uJj 

£ n / g 

ip-,i* k 


1 v n - l e~^ A ^‘ 


T(n 


N e \ 


e ~ aui te~ UJl *i p (J lt ) 

X ~ 


dJi*P 0 (9i* 


n 




\ \ u1X 1 -A e ^fjt_ 
A £ )J T(n) N e l 


,71— 1 


Af e 


U J. v £ 

rR e X 


£ (AnX 


i=1 


u 


- 1 _A e A ^ +1_fc (A^e + 1) 


T(n) 


A, (N e + 1 - 




r+OO \ ^JV e + l —fe 

^e-™ P (s)ds) 

+OO 

ns ni e~ us p(s)ds ). 


We have already introduced in this paper N na = N e + 1 — k, the number of non-allocated 
jumps. Of course, the conditional distribution £(N e \u, G) in (13) is identified by C(N na \u, G), 
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which can be derived as 


(14) 


£(N na \u, 6) (x £(N na , u, 6) oc A ; 

A, 


N na ( N na + k) 


na 

£,U 


N na \ 


k 


„-A e.ul ll£ ’ u A N na -1 , * \N na nr 

Ke ' CJV M - i)! A «-“ + ivj A '.“ )*<«"“«> 


A„ 


oc 


-'Pi (Nna, A £)U ) + 


k 


-Vo(N na ] A £jU ). 


Ku+k v ’ ' Ku+k 

On the other hand, the first factor in the right hand side of (13) can be computed by intro¬ 
ducing l* = (l *,..., l* k ), the vector of indexes of the allocated jumps and by observing that 
the augmented right hand side of (13) 

N na -\-k —1 

C(J,T,l*\Nna,U,0) = IT Pe(Jj)Po(Tj)e 

1 ‘1 k l k , 


, uJ i 


3=0 


(15) 


11' 

Ci=1 

1 


rii 


np(Ji* 


e uJ lV _r^ )p 




n 


e -uj 0 ^Mp Q{Tj) 


\ 


N e +1 


(iL 7 ? 6 UJl ^p(Jit)^Po(ri t )) [ n e~ uJj np(Jj)Po{jj ) j . 

v *=i ! 1 J W/ 


The first factor in the last expression refers to the unnormalized allocated process: the support 
is 9*. This shows point 2. of the Proposition. 

Therefore, the conditional distribution of P e is proportional to the following expression: 



( IT e~ uJj Kp(Jj )Pq (tj ) j 

H=1 ' 1 7 



+1 ~ k —a (N £ + 1) 


X A, 6 (N e + 1- 




s ni ne~ us p(s)ds. 


This yields points l.,3. and 4. of the Proposition. 

To show point 5., we need to integrate out N e from C(N e ,u, 6 ); we have: 


+oo 


+°° 7/ n— 1 ^Ne + l — k 7U 1 k r+oo 

C{u\9) oc Y, P( N e,u,d) = Y Y(nj e ~ Ae \ (N e + 1 - k)\ J[l p ^ ds 


N e =0 


N e =0 


„,n -1 +°° \Nna AT I U / k f+OO 

-»• V %^%±A(n / Ks ni e~ us p(s)ds 


T(n 


N na = 0 


A e A t na \ 


T(n 

This ends the proof. 




A, 


IT / Ks ni e us p(s)ds 


i= 1 ' 


□ 
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4 Prior moments of P £ 


Before deriving the first two moments of P £ , let us mention that the expect value and vari¬ 
ance of N e , the number of jumps considered in the approximation P £ , depend on the prior 
of e. Of course, if e is assumed fixed, IEfAO) = Var(Af £ ) = A £ < +oo, while, if e is random, 
then 

E(JV e ) = E((JV e |e)) = E(A £ ), Var(AT £ ) = Var(A £ ) + E(A e ). 


In this case, the mean and variance of N £ are not necessarily finite; see, for instance. Ta¬ 
ble 2 in Argiento et al. (2015), where P £ is the e-NGG process, and for some values of its 
hyperparameters the mean or the variance of N e are infinite. 

First of all, observe that 


(16) 


{xi + 


+ «,.)”•= £ 

miH- =m 

>0 


m 


m i,, 


,m N > 


N* 

n 

3 =1 


lib ^ 

= E'fi.«;>(*% 


E 

n\-\ - 1 -rik=m 

rij= 1,2,... 



n- 

'3 


where N* = N e + 1, Xj = 1 for all x 3 > 0, and the last summation is over all positive 
integers, being (16) the multinomial theorem. The second equality follows straightforward 
from different identifications of the set of all partitions of m (see Pitman, 2006, Section 1.2). 
Therefore, for any B € 13(0), m = 1,2,..., we have (here, instead of P, and To as in (5), there 
are Pn* and tn*): 


N* 


E(P e (B) m ) = E j E ( (J2 P 3 6 rj( B )) m \N £ 

3=1 


( 


= E 


/ 


E 


\ 


E 


rn 


miH- \-m N *=m v ’ Jv e' j=l 


N* 


H(P 3 5 Tj (B)ry\N £ 


V V >0 


J 


( 


= E 


= E 


/ 


E 


1 


\ 


E%.E vni n . 

K - ni+ ...+n k =m ' > n fc 

nj= i,2,... 


k=\ 


m 


E ) \n £ 

<31, —Jk i= l 


( 


\ 




k =1 


niH-| -nk=m 

■n j = 1,2,... 


m 


n\ , • • •, fik 


E E (II P ji\ N z) IT ( B )\ N e 


31,-,3k i= l 


1=1 
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= E 


m ^ 

£*e.«?)(*% £ 


k= 1 


niH-(- nk=m 

n.j =1,2,... 


m 


111, . . • , Tlk 


Pe(ni, ■ ■ ■ ,n k )(P 0 (B)) k 


We identify this last expression as 

E^P 0 (H) fc P(/i m = fe|iV e ) N ) , 


\k =1 


where K m is the number of distinct values in a sample of size in from P £ . Hence, we have 
proved that 


E (P £ (B) m ) = E ( E(P 0 (B) k ™\N £ )) = E ( P 0 (B ) Km ) . 


In particular, when m = 2, K m assumes value in {1, 2}, and the probability that K 2 = 1 is 
the probability that, in a sample of size 2 from P e , the samples values coincide, i.e. p £ ( 2). 
Therefore 

E(P e (B) 2 ) = P 0 (B)p £ (2 ) + (P 0 (B)) 2 (1 -p e (2)), 

and consequently 

(17) Var (P £ (B)) = P 0 (B)p £ (2) + P 0 (B) 2 ( 1 - p £ ( 2)) + P 0 (B) 2 = p £ (2)P 0 (B) (1 - P 0 (B)). 


Analogously, suppose that B i, B 2 € 13(0) are disjoint. Therefore 


N* N* 

E(P e (B 1 )P e (B 2 ))=E M £%<£!)£ PiS n (B 2 )\N e 

.1=1 /=i 


)) 


/ 


= E 


= E 


= E 


/ 


E 


N* 


\ 


Y PjS Tj (B 1 n B 2 ) + Y ^) S n ( p 2 )\Ne 

3 = 1 


1+3 


\ 


\ 


\ ZA? 
\H=i,-,A 

/ 


P 0 {B 1 )P 0 {B 2 ) Y ^ P 3 P l\ N e) 


V 


1+3 

j-l I.V 


= Po(Pi)P 0 (P 2 )p e (l,l). 


y 


The general case when /i| and /i 2 are not disjoint follows easily: 

E(P £ ( J B 1 )P e (5 2 )) = E ((P e (Pi n S 2 )) 2 ) + E (P e (Pi \ P 2 )P £ (Pi D P 2 )) 

+ E (P £ (P 2 \ Pi)P e (Pi n P 2 )) + E(P e (P! \ B 2 )P £ (B 2 \ Pi)), 
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where now the sets are disjoint. Applying the result above we first find that 

E(P e (S!)P e (S 2 )) = Pe(2)P 0 (B 1 n B 2 ) + (1 - p £ (2))P 0 (B 1 )P 0 (B 2 ), 

and consequently: 

Cov((P e (B 1 ), P £ (B 2 )) = p £ ( 2) (P 0 (Si n P 2 ) - P 0 (Si)fb(S 2 )) • 


5 e-NormCRM process mixtures 

Among the wide range of applications in which discrete random probability measures are 
exploited, hierarchical mixture models, dating back to Lo (1984), are frequently used when 
dealing with various data structures. Hence, as argued in the Introduction, their role is 
becoming more and more central in modern Bayesian Nonparametrics. We consider mix¬ 
tures of parametric kernels as the distribution of data, where the mixing measure is the 
e-NormCRM(p, kPq). The model we assume is the following: 

Yi\9i ~ d i = l,...,n 

Oi I P £ ~ P e , i = l,...,n 

(18) 1 

P £ ~ £ — NormCRM ( p , kPq), 

£ ~ vr(e), 

where /(•: 9j) is a parametric family of densities on ¥ C M p , for all 0 G 0 C M m . Remember 
that /}) is a non-atomic probability measure on 0, such that E(P £ (A)) = P\i(A) for all A € 
B(&) and all e > 0. Model (18) will be addressed here as e—NormCRM hierarchical mixture 
model. It is well known that this model is equivalent to assume that the Y/s, conditionally 
on P £ , are independently distributed according to the random density 

, N e 

h(y)= / f(y,8)P £ (dd) = YP J 

Je j=o 

In particular, we are able to build a blocked Gibbs sampler to update blocks of parameters, 
which are drawn from multivariate distributions. 

The parameter is (P e , e, 6), but we use the augmentation trick prescribed by the posterior 
characterization in Proposition 4, so that the new parameter is (P £ ,s, 0, u); the joint law of 
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data and parameters can be written as follows: 

n 

C(Y , e , u, P e ,e) = C(Y\0, u, P £ ,£)C(0, u, P £ \e)C(e) = f(Y t : 0i)C(9, u, P £ \e)7 r(e) 


2—1 


(19) 


u 


n— 1 


r(n) 




3=0 


i&C 1 




x J % n 7r(£) 


ieCfc 


where we used the hierarchical structure in (18). The Gibbs sampler generalizes that one 
provided in Argiento et al. (2015) for e-NGG mixtures. Description of the full-conditionals 
is below, and further details can be found in the Appendix. 


1. Sampling from C(u\Y, 6 , P £ ,e): from (19) it is easy to see that the factors depending 
on u identify this full-conditional as gamma with parameters (n, T e ), like the corre¬ 
sponding prior. 


2. Sampling from C(0\u,Y, P s ,e): each 0 lr for i = 1,..., n, has discrete law with support 
{t 0 , n,..., T Ne }, and probabilities P(0j = tj) oc J } f{Y l \ tj). 

3. Sampling from C{P £ , e\u, 0. Y): this step is not straightforward and can be split into 
two consecutive substeps: 

3.a Sampling from C{e\u, 6, Y): see the Appendix. 

3.b Sampling from C(P e \e, u. 0. Y)\ via characterization of the posterior in Propo¬ 
sition 4, since this distribution is equal to C{P e \s, u, 6). To put into practice, we 
have to sample (i) the number N na of non-allocated jumps, (ii) the vector of the 
unnormalized non-allocated jumps J^ na \ (Hi) the vector of the unnormalized allo¬ 
cated jumps J^ a \ the support of the allocated (■ iv ) and non-allocated (v) jumps. See 
the Appendix for a wider description. 


Remember that, when sampling from non-standard distributions, Accept-Reject or Metropolis- 
Hastings algorithms have been exploited. 


6 Normalized Bessel random measure mixtures: an application to 
density estimation 

In this section we introduce a new normalized process, called normalized Bessel random 
measure, corresponding to a specific choice for the intensity function p(-). Section 6.1 de- 
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scribes theoretical results: in particular, we show that this family encompasses the well- 
known Dirichlet process. Then we fit the mixture model to synthetic and real datasets in 
Section 6.2. Results are illustrated through a density estimation problem. 


6.1 Definition 


Let us consider a normalized completely random measure corresponding to mean intensity 

p(s;uj) = -e _a;s /o(s), s > 0, 
s 

where ui > 1 and 


/,(») = £ ,1 /2 | 

m= 0 v ' 

is the modified Bessel function of order v > 0 (see Erdelyi et al., 1953, Sect 7.2.2). It is 
straightforward to see that, for s > 0, 

1 + °° 1 

( 20 > = r"“ + E 2^!p s2 ”- le "“' 

m= 1 v ’ 

so that p is the sum of the Levy intensity of the gamma process with rate parameter uj and 
of the Levy intensities 

(21) Pm(s-,uj) = 22m( 1 m , )2 g 2m - 1 e- a,a , s>0, m = l,2,... 

corresponding to finite activity Poisson processes. It is simple to check that (1) holds. Hence, 
following (3) in Section 2, we introduce the normalized Bessel random measure P, with parame¬ 
ters (u, k), where ui > 1 and n > 0. Thanks to (20) and the Superposition Property of Poisson 
processes, in this case, the total mass T in (3) can be written as 

+oo 

(22) T = T g +Y^ T m , 

m= 1 

where Tq, T), T2, ... are independent random variables, Tq being the total mass of the 
gamma process and T m the total mass of a completely random measure corresponding 
to the intensity u. m {ds,dr) = p m {s)dsKP{ i (dT). In particular, Tq ~ gamma(K,w), while 
T m = Ylf=o > w h ere ~ Poi(i\T{2m)/({2(jj)‘ 2rn (m\) 2 )), and are the points of 

a Poisson process on M + with intensity np m . By this notation we mean that T m is equal to 0 
when N m = 0, while, conditionally to N m > 0, ~ gamma(2m,u). We can write down 
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the density function of T, via (2): 


V’(A) := — log ^E(e AT )^ = K J (1 — e Xs )p{s;uj)ds 

0 /*+OO ] +oo -i /*+oo 

(1 -e- Xs )-e~ us ds+y _ , / (1 — e _As )s 

o 2 2 -(m!) 2 7o 


- A ^„2m-l e - WS& 


= « ^log 
= « ^log 

= AC log 


T(2m) 


u; + A\ ^ T(2m) ^ 

a; J 2-/ 2 2m (m\) 2 u> rn ^ 2 2rn (m\) 2 (ui + A) 

' m— 1 v 7 m=l 7 


cc + A 


cc 


1 1 


2 2 


1 1 


-log O + o V 1 ~ “3 + log n + nt 1“ 




2 2 


(w + A) 5 


LO 


+ A + \f [ui + A) 2 — 1 




+ Vw 2 -1 




The same expression is obtained when T frit) = k{uj + \/lo 2 — 1) K ^ I K (t), t > 0 (see 
Gradshteyn and Ryzhik, 2007, formula (17.13.112)). Observe that, when lo = 1, fx is called 
Bessel function density (Feller, 1971). By (9), the eppf of the normalized Bessel random mea¬ 
sure is: 


(23) 


where 


p B (n 1 ,...,n k ]io,K) = n k 

J o 


+ 00 yU— 1 


LO 


+ Vlo 2 -1 


1 


r(n) ycc + u + ^ (lo + u) 2 — 1 J (u + u) T 

1 r . . (71 j Ti j -|“1 1 

xri r ( n i) 2Fi 


3 =1 


('U + Lo ) 2 


du, 


TP ( \ SD ( a l)m ( a2 )m 1 / \m , T(a + m) 

2 F 1 (ai,a 2 -,r,z ) := 2 ^ -73-“37 (*) , with (a) m := 


m=0 


(7)r 


m\ 


T(a) 


is the hypergeometric series (see Gradshteyn and Ryzhik, 2007, formula (9.100)). 

The following proposition shows that the eppf of the normalized Bessel random measure 
converges to the eppf of the Dirichlet process as the parameter to increases. 

Proposition 5. Let (m,... ,n k ) be a vector of positive integers such that Yli=i = n > where 
k = 1 ,n. Then, the eppf (23), associated with the normalized Bessel random measure P with 
parameter (co, ac), lo > 1, ac > 0, and mean measure Pq, is such that 


lim p B (ni,n k ; w, ac) = p D (ni,n k ; ac), 

uj— >-+oo 

where pD(ni, • • •, n k ; ac) is the eppf of the Dirichlet process with measure parameter kPq. 
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Proof. The eppf of the Dirichlet process appeared first in Antoniak (1974) (see Pitman, 1996); 
anyhow, it is straightforward to derive it from (9): 


PD(m,... ,n k ;n) = / 

Jo 


+oo ^n-1 


^n« , r(n -’> du 

(U + u) n 3 

3 = 1 


r+oo u n— 1 


T(ra) \co + uJ (u T w) n ^ 


1 k 


T(/tTn) ' . -f- 
.7 = 1 


n r <’ 


where the last equality follows from formula (3.194.3) in Gradshteyn and Ryzhik (2007). By 
definition of the hypergeometric function, we have 


1 < 2-^1 


n 1 rij + 1 1 \ _ 

T ,_ 2~ ;1 ’ (u + ujY ) ~ 2Fl 


m n J + 1 .i. 1 
2 ’ 2 ’ ’ J- 


Moreover 


u + vka 2 — 1 _ uj 1 + — 1/ta 2 

(u + 0j) + y/l(u + Lu) 2 - 1) U + U 1 + y/l - l/(ti + w )2 


1 + -\/l ~ l/^i 2 < 1 + \/l ~ 1/ia 2 ^ 1 

2 _ 1 + y 7 ! - l/(tt + w) 2 _ 


so that 


1 + \J 1 — 1/uP 


PD(ni, ...,n k -K)< p B (ni,. .. ,n k ;uj,K ) 


n ^ Tlj Tlj 1 1 \ , , 

2G1 I Y J 2 — ;1; ^2 JPz?(ni,...,n fe ;K). 
f=i ' “ ^ 7 

The left hand-side of these inequalities obviously converges to pd(m, • • -, n k : n) as uj goes to 
Too. On the other hand. 


"j + 1 .l. 1 
2 ’ 2 ’ ’ ca 2 


1 clS UJ —V “bOO, 


thanks to the uniform convergence of the hypergeometric series 2 1 7 ! (^f, " J 2 ' : 1; z) on a disk 
of radius smaller that 1. We conclude that, for any ni,... ,n k such that n\ + • • • T n k = n, 
k = 1,. ,. ,n, and any k > 0, 

lim p B (ni,... ,n k -,uj,K) = p D (ni,... ,n k \ k). 

LJ —>-+00 


Since the eppf is the joint distribution of the number K n of distinct values and corre¬ 
sponding sizes N\,... ,N k (see equation (30) in Pitman (1996)) in a sample of size n from the 
normalized Bessel completely random measure, by marginalization we obtain 


P(K n = k) = - j: 


77/15 • • • 5 ^ k 


P B (jii, ..., n k ), k = 1,... ,n, 
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where the sum is over all the compositions of n into k part, i.e., all positive integers such that 
n\ + ■ ■ ■ + ilk = n - Unfortunately, we were not able to simplify further this last expression, 
because of the summation of the hypergeometric functions 2 F 1 occurring in the analytic 
expression (23) of pb- Since the number of partitions of n items in k blocks can be very 
high (it is given by the Stirling number S(n; k) of the second kind) and the evaluation of 
2 -F 1 computationally heavy, we prefer to use a Monte Carlo strategy to simulate from the 
prior of K n . The simulation strategy is also useful to understanding the meaning of the 
parameters of the normalized Bessel random measure: k has the usual interpretation of the 
mass parameter, since, when fixing u, E (K n ) increases with k. On the other hand, the effect 
of oj is quite peculiar: decreasing uj (thus drifting apart from the Dirichlet process), with k 
fixed, the prior distribution of K n shifts towards smaller values. However, when E (K n ) is 
kept fixed, the distribution has heavier tails if uj is small (see Figures 1 and 3 (a)). 



Figure 1: Prior distribution of K n under a sample from e-NB process with e = 10 - 6 ,w = 1.05 
and several values for k, as reported in the legend. 


6.2 Application 

In this section let us consider the hierarchical mixture model (18), where the mixing measure 
is P £ , the e-approximation of the normalized Bessel random measure, as introduced above 
(here e-NB(w, kPq) mixture model). Of course, when e is small, this model approximates the 
corresponding mixture when the mixing measure is P; to the best of our knowledge, this 
normalized Bessel completely random measure has never been considered in the Bayesian 
nonparametric literature. By decomposition (22), we argue that this model is suitable when 
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the unknown density shows many different components, where a few of them are very spiky 
(they should correspond to Levy intensities (21)), while there is a folk of flatter components 
which are explained by the intensity (l/s)e~ ws of the Gamma process. For this reason, we 
consider a simulated dataset which is a sample from a mixture of 5 Gaussian distributions 
with means and standard deviations equal to {(15,1.1), (50,1), (20,4), (30, 5), (40, 5)}, and 
weights proportional to {10, 9,4, 5,5}. The histogram of the simulated data, for n = 1000, is 
reported in Figure 2. 

We report posterior estimates for different sets of hyperparameters of the e-NB mixture 
model when /(•; 0) is the Gaussian density on M and 6 = (p, a 2 ) stands for its mean and vari¬ 
ance. Moreover, Po(dp, da 2 ) = J\f(dp; y n , a 2 /kq) x inv — gamma(da 2 ; a, b); here Af(y n , v 2 /kq) 
is the Gaussian distribution with mean y n (the empirical mean) and variance a 2 /kq, and 
inv — gamma(d(j 2 ; a, b ) is the inverse-gamma distribution with mean b/(a — 1) (if a > 1). We 
set kq = 0.01, a = 2 and b = 1 as proposed first in Escobar and West (1995). We shed light on 
three sets of hyperparameters in order to understand sensitivity of the estimates under dif¬ 
ferent conditions of variability; indeed, each set has a different value of p £ ( 2), which tunes 
the a-priori variance of P £ , as reported in (17). We tested three different values for p £ ( 2): 
p e ( 2) = 0.9 in set ( A ), p £ ( 2) = 0.5 in set (B) and p £ ( 2) = 0.1 in set ( C ). Moreover, in each 
scenario we let the parameter 1/oj ranges in {0.01, 0.25, 0.5,0.75,0.95}; note that the extreme 
case of oj = 100 (or equivalently 1/a; = 0.01) corresponds to an approximation of the DPM 
model. The mass parameter k is then fixed to achieve the desired level of p £ ( 2). As far as 
the choice of e concerns, we set it equal to 1CT 6 : at the end, we got 15 tests, listed in Table 1. 
It is worth mentioning that it is possible to choose a prior for e, even if, for the p in (20), the 
computational cost would greatly increase due to the evaluation of functions 2 Pi in (23). 

We have implemented our Gibbs sampler in C++. All the tests in Sections 6 and 7 were 
made on a laptop with Intel Core i7 2670QM processor, with 6GB of RAM. Every run pro¬ 
duced a final sample size of 5000 iterations, after a thinning of 10 and an initial burn-in of 
5000 iterations. Every time the convergence was checked by standard R package CODA 
tools. Here, we focus on density estimation: all the tests provide similar estimates, quite 
faithful to the true density. Figure 2 shows density estimate and pointwise 90% credibility 
intervals for case A5; the true density is superimposed as dashed line. Figure 3 (a) and (b) 
display prior and posterior distributions, respectively, of the number K n of groups, i.e. the 
number of unique values among (9 1 ,..., 9 n ) in (18) under two sets of hyperparameters, Al, 
representing an approximation of the DPM model, and A5, where the parameter u> is nearly 
1. From Figure 3 it is clear that A5 is more flexible than Al: for case A5, a priori the variance 
of K n is larger, and, on the other hand, the posterior probability mass in 5 (the true value) is 
larger. 
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CM 



10 20 30 40 50 

simulated data 


Figure 2: Density estimate for case A5: posterior mean (line), 90% pointwise credibility 
intervals (shadowed area), true density (dashed) and the histogram of simulated data. 



(a) (b) 

Figure 3: Prior (a) and posterior (b) distributions of the number K n of groups for test A1 
(gray) and A5 (blue). 


In order to compare different priors, we take into account five different predictive good- 
ness-of-fit indexes: (i) the sum of squared errors (SSE) , i.e. the sum of the squared differ- 
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ences between the y t and the predictive mean E (Yi\data) (yes, we are using data twice!); 
(ii) the sum of standardized absolute errors (SSAE), given by the sum of the standardized 
error | yi — E( Y t \ data) |/ y/ Var( Y,\data ); (in) log-pseudo marginal likelihood (LPML), quite 
standard in the Bayesian literature, defined as the sum of log (CPOi), where CPOi is the 
conditional predictive ordinate of y lr the value of the predictive distribution evaluated at y„ 
conditioning on the training sample given by all data except yi. The last two indexes, (iv) 
WAIC\ and (v) WAIC 2 , as denoted here, were proposed in Watanabe (2010) and deeply 
analyzed in Gelman et al. (2014): they are generalizations of the AIC, adding two types of 
penalization, both accounting for the "effective number of parameters". The bias correc¬ 
tion in WAIC\ is similar to the bias correction in the definition of the DIC, while W AIC 2 is 
the sum of the posterior variances of the conditional density of the data. See Gelman et al. 
(2014) for their precise definition. Table 1 shows the values of the five indexes for each test: 
the optimal (according to each index) tests are highlighted in bold for the experiments ( A ), 
(B) and ( C ). It is apparent that the different tests provide similar values of the indexes, but 
SSE, indicating that, from a predictive viewpoint, there are no significant differences among 
the priors. However, especially when the value of k is small, i.e. in all tests A and B, a 
model with a smaller oj tends to outperform the Dirichlet process case (approximately, when 
oj = 100). On the other hand, the SSE index shows quite different values among the tests: 
it is well-known that this is a index favoring complex models and leading to better results 
when data are over-fitted. Therefore, tests with an higher value of k are always preferable 
according to this criterion. 

We fitted our model also to a real dataset, the Hidalgo stamps data of Wilson (1983) 
consisting of n = 485 measurements of stamp thickness in millimeters (here multiplied by 
10 3 ). The stamps have been printed between 1872 and 1874 on different paper types, see data 
histogram in Figure 4. This dataset has been analyzed by different authors in the context of 
mixture models: see, for instance, Izenman and Sommer (1988), McAuliffe et al. (2006) and 
Nieto-Barajas (2013). 

We report posterior inference for the set of hyperparameters which is most in agree¬ 
ment with our prior belief: the mean distribution is Po(dg, da 2 ) = Afidfi: y n . a 2 / kq) x inv — 
gamma(da 2 ; a, b) as before, and ko = 0.005, a = 2 and b = 0.1. The approximation param¬ 
eter s of the e-NB(w, kPI) random measure is fixed to 10 -6 ; on the other hand, in order to 
set parameters oj and k, we argue as follows: oj ranges in {1.05,5,10,1000} and we choose 
the mass parameter k such that the prior mean of the number of clusters, i.e. E (K n ), is the 
desired one. As noted in Section 6.1, a closed form of the prior distribution of K n is not 
available, so we resort to Monte Carlo simulation to estimate it. Table 2 shows the four 
couples of (oj, k) yielding E (K n ) = 7: indeed, according to Ishwaran and James (2002) and 
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Table 1: Predictive goodness-of-fit indexes for the simulated dataset. 


Test 

cu 

K 

SSE 

SSAE 

WAIC1 

WAIC2 

LPML 

A1 

100 

0.06 

6346.59 

811.16 

-3312.44 

-3312.55 

-3312.55 

A2 

4 

0.09 

5812.86 

810.43 

-3312.33 

-3312.42 

-3312.43 

A3 

2 

0.1 

6089.19 

810.99 

-3312.38 

-3312.47 

-3312.48 

A4 

1.33 

0.11 

6498.23 

811.29 

-3312.54 

-3312.62 

-3312.63 

A5 

1.05 

0.11 

5725.18 

810.39 

-3312.27 

-3312.36 

-3312.36 

B1 

100 

0.43 

5184.25 

809.61 

-3311.95 

-3312 

-3312.01 

B2 

4 

0.67 

5125.41 

809.7 

-3312.19 

-3312.25 

-3312.26 

B3 

2 

0.81 

4610.39 

809.42 

-3311.92 

-3311.98 

-3312 

B4 

1.33 

0.93 

4246.43 

809.07 

-3311.75 

-3311.83 

-3311.84 

B5 

1.05 

1 

4571.09 

809.08 

-3311.96 

-3312.05 

-3312.06 

Cl 

100 

1.56 

3707.5 

809.36 

-3311.73 

-3311.86 

-3311.88 

C2 

4 

2.67 

2194.1 

808.8 

-3312.02 

-3312.23 

-3312.26 

C3 

2 

3.64 

1223.86 

809.28 

-3312.62 

-3312.96 

-3312.99 

C4 

1.33 

5.29 

748.85 

808.7 

-3313.05 

-3313.51 

-3313.54 

C5 

1.05 

8.95 

685 

807.96 

-3312.9 

-3313.36 

-3313.38 
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(a) (b) 

Figure 4: Posterior inference for the Hidalgo stamp data for Test 4: histogram of the data, 
density estimate and 90% pointwise credibility intervals (a); posterior distribution of K n (b). 

McAuliffe et al. (2006) and references therein, there are at least 7 different groups (but the 
true number is unknown), corresponding to the number of types of paper used. For an in- 
depth discussion about the appropriate number of groups in Hidalgo stamps data, we refer 
the reader to Basford et al. (1997). Table 2 also reports prior standard deviations of K n : even 
if the a-priori differences are small, the posteriors appear to be quite different among the 4 
tests. All the posterior distributions on K n support the conjecture of at least seven distinct 
modes in the data; in particular. Figure 4 (b) displays the posterior distribution of K n for 
Test 4. A modest amount of mass is given to less than 7 groups, and the mode is in 11. Even 
Test 1, corresponding to the Dirichlet process case, does not give mass to less than 7 groups, 
where 9 is the mode. Density estimates seem pretty good; an example is given in Figure 4 
(a), with 90% credibility band for Test 4. 

As in the simulated data example, some predictive goodness-of-fit indexes are reported 
in Table 2: the optimal value for each index is indicated in bold. The SSE is significantly 
lower when u is small, thus suggesting a greater flexibility of the model with small values 
of c o. The other indexes assume the optimal value in Test 4 as well, even if those values are 
similar along the tests. 
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Table 2: Predictive goodness-of-fit indexes for the Hidalgo stamps data. 


Test 

UJ 

K 

E(K n ) 

sd(K n ) 

SSE 

SSAE 

WAIC1 

WAIC2 

LPML 

1 

1000 

0.98 

7 

2.04 

15.17 

384.1 

-713.12 

-713.96 

-714.12 

2 

10 

0.91 

7 

2.13 

12.85 

383.51 

-713.22 

-714.04 

-714.25 

3 

5 

0.92 

7 

2.18 

13.52 

383.68 

-713.52 

-714.3 

-714.4 

4 

1.05 

1.02 

7 

2.32 

11.12 

383.38 

-712.84 

-713.66 

-714.05 


7 Linear dependent NGG mixtures: an application to sports data 

Let us consider a regression problem, where the response Y is univariate and continuous, 
for ease of notation. We model the relationship (in distributional terms) between the vec¬ 
tor of covariates x = (x\,.... x p ) and the response Y through a mixture density, where 
the mixing measure is a collection {P x ,x € X} of e-NormCRMs, being X the space of all 
possible covariates. We follow the same approach as in MacEachern (1999), MacEachern 
(2000), De Iorio et al. (2009) for the dependent Dirichlet process. We define the dependent 
e-NormCRM process { P x . x € X}, conditionally to x, as: 

N e 

(24) 

3=0 

The weights Pj are the normalized jumps as in (5), while the locations 7 'j(x), j = 1,2,..., are 
independent stochastic processes with index set X and Pq x marginal distributions. Model 

(24) is such that, marginally, P x follows a e-NormCRM process, with parameter ip, kPq x ), 
where p is the intensity of a Poisson process on M + , n > 0, and Pq x is a probability on M. 
Observe that, since N e and Pj do not depend on x, (24) is a generalization of the single 
weights dependent Dirichlet process (see Barrientos et al., 2012, for this terminology). We 
also assume the functions x 7 j[x) to be continuous. 

The dependent e-NormCRM process in (24) takes into account the vector of covariates 
x only through 7 j{x). In particular, when the kernel of the mixture (18) belongs to the 
exponential family, for each j, 7 j(x) = p(x: Tj) can be assumed as the link function of a 
generalized linear model, so that (18) specializes to 

f(y;-y(xi,di )) i = l,...,n 

(25) iid 

di\P £ ~ P £ i = 1,..., n where P £ ~ e — NormCRM(p, kPq). 

This last formulation is convenient because it facilitates parameters interpretation as well as 
numerical posterior computation. 
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We analyze the Australian Institute of Sport (AIS) data set (Cook and Weisberg, 1994), 
which consists of 11 physical measurements on 202 athletes (100 females and 102 males). 
Here the response is the lean body mass (lbm), while three covariates are considered, the 
red cell count (rcc), the height in cm (Ht) and the weight in Kg (Wt). The data set is con¬ 
tained in the R package DPpackage (Jara et al., 2011). The actual model (25) we consider 
here is when /(•; p, rj 2 ) is the Gaussian distribution with // mean and rj 2 variance; moreover, 
/j = 7 (x,0) = x t 6, and the mixing measure P e is the e-NGG(n, o. Pq), as introduced in 
Argiento et al. (2015). We have considered two cases, when mixing the variance rj 2 with 
respect to the NGG process or when the variance i] 2 is given a parametric density; in both 
cases, by linearity of the mean x t 6, the model (here called linear dependent NGG mixture) 
can be interpreted as a NGG process mixture model, and inference can be achieved via an al¬ 
gorithm similar to that in Section 5. We set £ = 10 -6 , cr € {0.001,0.125,0.25}, and k such that 
E (K n ) ~ 5 or 10. When the variance rj 2 is included in the location points of the s-NGG pro¬ 
cess, then Po is Ah (bo. E ( ,) x inv-gamma ( uq /2 , vorfi/2); on the other hand, when rj 2 is given a 
parametric density, then rj 2 ~in v-gamma (vq / 2 , //() //{/ 2 ). We fixed hyperparameters in agree¬ 
ment with the least squares estimate: bo = (“50,5,0,0), So = diag( 100,10,10,10), uq = 4, 
r )o = l- For all the experiments, we computed the posterior of the number of groups, the 
predictive densities at different values of the covariate vectors and the cluster estimate via 
posterior maximization of Binder's loss function (see Lau and Green, 2007). Moreover, we 
compared the different prior settings computing predictive goodness-of-fit tools, specifically 
log pseudo-marginal likelihood (LPML) and the sum of squared errors (SSE), as introduced 
in Section 6.2. The minimum value of SSE, among our experiments, was achieved when rj 2 is 
included in the location of the e-NGG process, a = 0.001 and k = 0.8 so that E (K n ) ~ 5. On 
the other hand, the optimal LPML was achieved when a = 0.125, k = 0.4, and E( K n ) ~ 5. 
Posterior of K n and cluster estimate under this last hyperparameter setting are in Figure 5 
((a) and ( b ), respectively); in particular the cluster estimate is displayed in the scatterplot of 
the Wt vs lbm. In spite of the vague prior, the posterior of K n is almost degenerate on 2, 
giving evidence to the existence of two linear relationships between lbm and Wt. 

Finally, Figure 6 displays predictive densities and 95% credibility bands for 3 athletes, a 
female (Wt=60, rcc=3.9, Ht=176 and lbm=53.71), and two males (Wt=67.1,113.7, rcc=5.34,5.17, 
Ht=178.6, 209.4 and lbm=62,97, respectively); the dashed lines are observed values of the re¬ 
sponse. Depending on the value of the covariate, the distribution shows one or two peaks: 
this reflects the dependence of the grouping of the data on the value of x. This figure high¬ 
lights the versatility of nonparametric priors in a linear regression setting with respect to the 
customary parametric priors: indeed, the model is able to capture in detail the behavior of 
the data, even when several clusters are present. 
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Figure 5: Posterior distributions of the number K n of groups (a) and cluster estimate (b) 
under the linear dependent e —NGG mixture. 





(a) 


(b) 


(c) 


Figure 6: Predictive distributions of lbm for three different athletes: Wt=60, rcc=3.9 / Flt=176 
(a), Wt=67.1, rcc=5.34, Flt=178.6 (b), Wt=113.7, rcc=5.17, Ht=209.4 (c). The shaded area is 
the predictive 95% pointwise credible interval, while the dashed vertical line denotes the 
observed value of the response. 
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8 Discussion 

We have proposed a new model for density and cluster estimation in the Bayesian non- 
parametric framework. In particular, a finite dimensional process, the e-NormCRM, has 
been defined, which converges in distribution to the corresponding normalized completely 
random measure, when e tends to 0. Here, the e-NormCRM is the mixing measure in a 
mixture model. In this paper we have fixed £ very small, but we could choose a prior for 
e and include this parameter into the Gibbs sampler scheme. Among the achievements of 
the work, we have generalized all the theoretical results obtained in the special case of NGG 
in Argiento et al. (2015), including the expression of the eppf for an e-NormCRM process, 
its convergence to the corresponding eppf of the nonparametric underlying process and 
the posterior characterization of P e . Moreover, we have provided a general Gibbs Sampler 
scheme to sample from the posterior of the mixture model. To show the performance of our 
algorithm and the flexibility of the model, we have illustrated two examples via normalized 
completely random measure mixtures: in the first application, we have introduced a new 
normalized completely random measure, named normalized Bessel random measure; we 
have studied its theoretical properties and used it as the mixing measure in a model to fit 
simulated and real datasets. The second example we have dealt with is a linear dependent 
e-NGG mixture, where the dependence lies on the support points of the mixing random 
probability, to fit a well known dataset. Current and future research is devoted on the use 
of our approximation on more complex dependence structures. 
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APPENDIX: DETAILS ON FULL-CONDITIONALS FOR THE GIBBS SAMPLER 


Here, we provide some details about Step 3 of the Gibbs Sampler in Section 5. As far as 
Step 3a is concerned, the full-conditional C(e\u, 6) is obtained integrating out N £ (or equiv¬ 
alently N na ) from the law C(N e , u. 6), as follows: 


-boo 


C{e\u,0,Y)cx Y, £(N na ,£,u,0,Y) 


N na =0 

+oo 


( \ -A e ^E,u a (Nna + k ) yy f + °° 

Y e - jj—i —11/ KS e py s > ds 


N na =0 
/ ^ p+oo 


i= 1 ' 


I Ks ni e us p(s)ds e Ae ’ u Ae ^ £,u ^ + k ir(e) = f £ (u;ni, 


\i= 1 ' 


,n k ) vr(e), 


where we used the identity Y'u a (JA na + k)/(N na \) = e As ’ u (A £jU + k ), as previously 

noted. Moreover, f £ (u ; rii,...,«/,) is defined in (11). This step depends explicitly on the 
expression of p(s). 

Step 3.b consists in sampling from C(P s \e,u,0) and has already been described in the 
proof of Proposition 4. However, for a complete outline of the algorithm, we list the full- 
conditionals resulting into Step 3b: 

(i) . £(N na \e, Y,u, 0) = Vi(A £U ) + -— - V o(A £M ); this is formula (14). 

^■eu i rv J^-eu H" & 

(ii) . Non-allocated jumps: iid from £(J :j ) oc e~ uJj p( Jj)tr £j00 \(Jj), j = 1, •.. ,N na ; see the 

second factor of the last expression in (15). 

(iii) . Allocated jumps: iid from C{Ji*) oc Jj*e uJl i p(J;*)l( £ ,oo)(^*)/ * = 1, • • •, k; see the first 

factor of the last expression in (15). 


(iv). Non-allocated points of support: iid from P 0 ; see (19). 


(v). Allocated points of support: iid from £(r* ) oc {fl k(Xj-,Ti)}Po( T i), i. = 1,.... k; see 
(19). 
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